Privacy Preserving Web Query Log Publishing: A 
Survey on Anonymization Techniques 



Amin Milani Fard 

Simon Fraser University, Burnaby, Canada 
University of British Columbia, Vancouver, Canada 
milanifard@cs.sfu.ca 



ABSTRACT 

Releasing Web query logs which contain valuable infor- 
mation for research or marketing, can breach the privacy 
of search engine users. Therefore rendering query logs 
to limit linking a query to an individual while preserving 
the data usefulness for analysis, is an important research 
problem. This survey provides an overview and discus- 
sion on the recent studies on this direction. 

1. INTRODUCTION 

Web search queries are generally stored by search 
engines for the purpose of improving result ranking, 
query refinement, user modeling, language-based ap- 
plications, and sharing data for academic research 
or commercial needs 10 . On the other hand, rcleas- 



The process of anonymization 11 13] refers to 



ing such data without proper anonymization can se- 
riously breach the privacy of search engine users. In 
2006 the America Online (AOL) query log data of 
650k users over three months, was released after re- 
moving all explicit identifiers of searchers as shown 
in Figure 1 [6]. Shortly after that, the searcher No. 
4417749 was traced back to the 62-year-old widow 
Thelma Arnold who lives in Lilburn. This scan- 
dal made data publishers reluctant in providing re- 
searchers with public anonymized query logs 22 



Since then, an important research problem opened 
on rendering Web query log data to limit linking a 
query to a specific individual while the data is still 
useful for analysis. 

1.1 Privacy-Preserving Data Publishing 

Researchers in the field of privacy-preserving data 
publishing focus on designing techniques to publish 
data as useful as possible while preserving the pri- 
vacy of individuals 19 1. Publishing data instead of 
publishing data mining results is much more use- 
ful and interesting because many other analysis can 
be done on such data. Thus the published data 
should be potentially useful for many data analy- 
sis objectives which makes privacy-preserving data 
publishing challenging. 



hiding the identity (or sensitive information) of in- 
dividuals. Removing explicit identifiers (such as 
name) is not effective since non-identifying personal 
data (such as age, gender, zipcode) can be combined 
with publicly available data to identify an individ- 
ual |38| . The combination of such non-explicit iden- 
tifiers are called the quasi-identifier (QI) attributes 



13 , which could be used to identify an individual 



with some sensitive attribute (SA) such as his dis- 
ease. 

1.1.1 Data Privacy Attacks 

The most common privacy threats are record link- 
age, attribute linkage, and table linkage, at which an 
attacker tries to link a record of an individual to a 
record in a published table, to a sensitive attribute 
in a published table, or to the published data table 
itself, respectively (l9| . 

In record/attribute linkage, the attacker knows 
the victims record is in the released table. If some 
value on QI which matches victims QI, identifies 
a small number of records, the victim can be dis- 
tinguished with high probability. In table linkage, 
however, the attacker does not know whether a vic- 
tim exist in the released table and tries to determine 
the presence/absence of the victims record. In some 
other privacy models, we do not concern about such 
linkage attack, but we concern about the change in 
the attacker's probabilistic belief on the SA value 



of a victim after seeing the published data 19 



1.1.2 Data Privacy Models 

We explain some well-known approaches to pre- 
vent privacy attacks. The notion of k-anonymity 



36 is a solution to record linkage attacks, where 
the QI of each record should be the same as at least 
fc-1 other records. This ensures that the probability 
of linking an individual to a specific record based on 
QI is at most r- 

As a solution to attribute linkage attack, the £- 



diversity notion 31 requires each group of records 
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Figure 1: Sample query log from AOL released data [3] 
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Figure 2: A simple taxonomy tree 



with the same QI, to have at least I "well-represented'' 
SAs. This ensures that there are at least I distinct 
values for the SA in each such group, thus auto- 
matically satisfies fc-anonymity, where k=£. The l- 
diversity could not prevent attribute linkage attacks 
if the overall distribution of a SA is skewed. As a 



solution, the notion of t-closeness 29 requires the 



distribution of a sensitive attribute in any group on 
QID to be close to the distribution of the attribute 
in the overall table. 

The (pi, p2)-privacy [18] guarantees that if at- 
tacker's prior knowledge on a SA value before data 
release is at most p\ then after seeing the released 
data, his posterior knowledge is bounded by p2, 
where 0<p\<p2<l. 



The notion of e- differential privacy 15 guaran- 



tees that the addition or removal of a "single" record 
in the database will not significantly change the 
statistical analysis results, e-diffcrcntial privacy as- 
sures record owners that submitting their personal 
information to the database is very secure. 

1.1.3 Data Anonymization Techniques 

We explain three major techniques to guarantee 
privacy notions. 

Generalization and Suppression: In suppres- 
sion we delete some values, and in generalization 
we replace some values with their less specific val- 
ues. For the generalization, we replace categorical 
attributes with respect to a given taxonomy, such 
as the one shown in Figure [2] Values m numeri- 
cal attributes are usually replaced with an interval 
containing the original values. 

all values 



In full-domain generalization 37 , 38 



in an attribute are generalized to the same level of 
the taxonomy tree. For example with taxonomy in 
Figure [2j if Beef and Chicken are generalized to 
Meat, then Apple, Orange and Banana should be 
generalized to Fruit. In subtree generalization |7j, 
24 , either all child nodes or none are generalized. 
For example, in Figure [2] this scheme requires that 



if Beef is generalized to Meat, then the other child 
node, Chicken, would also be generalized to Meat, 
but Apple and Orange, which are child nodes of 
Fruit, can remain ungeneralized. In cell generaliza- 
tion 41 , also known as "local recoding" , only some 
instances of a value will be generalized compared to 
"global recoding" in which if a value is generalized, 
all its instances are generalized. 

Major suppression techniques are Record suppres- 
(Zl ' 1^1' 37 , and cell suppression (or local sup- 
which are processes of suppress- 



: J ,2 



sion 

pression)[ll 

ing an entire record, or suppressing some instances 
of a given value in a database, respectively. 

Anatomization and Permutation: In anato- 
mization |40| the QI or the SAs are not modified and 
instead the QI data and the SAs data will be pub- 
lished in two separate tables: a QI table containing 
the quasi-identifier attributes, a SA table containing 
the sensitive attributes, and tables have one com- 
mon GroupID attribute. In permutation method 
44 records are partitioned into groups and then 
their SA values within each group will be shuffled. 

Perturbation and Randomization: In pertur- 
bation the original data values are replaced with 
some synthetic data values in such a way that the 
statistical information is preserved. The additive noise 
technique [l][l7] alters a sensitive numerical data 
such as salary by adding a random value drawn from 
some distribution. The data swapping method in 
which SA values of records are exchanged, can pro- 
tect numerical and categorical attributes 35 . Au- 



thors in 18 also proposed a randomization approach 



based on data swapping to limit the attacker's back- 
ground knowledge on inferring sensitive attributes. 

1.2 Contributions and Paper Organization 

In this survey, we provide an overview of the re- 
cent studies in privacy-preserving Web query log 
publishing. We explain privacy notions, attacks, 
and the utility challenges in query log anonymiza- 
tion. We categorize the recent privacy-preserving 
query log publishing techniques into transactional 
and non-transactional anonymity approaches. 

The rest of the paper is organized as follows. 
In Section 2, we study the problem of query log 
anonymization and its challenges. We categorize 
the existing anonymization methods in Section 3 
and summarize and discuss these methods in sec- 
tion 4. We conclude the paper in Section 5. 



2. PRIVACY PRESERVING WEB QUERY 
LOG DATA PUBLISHING 

The problem of Web query-log anonymization have 
been examined with [27], [2], 



10 from Web 



27], [2], and 
community with focus on privacy attacks, and [17] 

16 



21 , 39 , 43 , 23 



£6], [33], [30], [9], and 

from the database community with focus on 
transaction database anonymization. In this survey 
we study both group of works and categorize them 
into non-transactional and transactional anonymity 
models respectively. However, the major part be- 
longs to transactional model meaning that we treat 
query logs as transaction data (unstructured data 
without a fixed set of attributes) , where each trans- 
action represents a query and each item represents 
a query term. Such data is a rich source for many 
data mining applications such as association rule 
mining, search recommendations, and etc. 

2. 1 Privacy Attacks on Published Query Logs 

Authors in [12], mentioned possible attacks on 
published Web queries. Some sensitive information 
can be obtained directly from query content such as 
social security numbers, credit card numbers, etc. 
Demographic and geographical information such as 
location, etc. could also help attacker to find the 
identity of a user. Even following the URL of pages 
that a user clicked can reveal the users identity 
when combined with his/her other queries. And 
finally, when a user ID was identified, the adversary 
can easily discover all the users private queries by 
looking at the entire search history. 

2.2 Challenges in Anonymizing Query Logs 

As mentioned earlier, query log data can be seen 
as a special case of transaction data, where each 
transaction contains several "items" from an item 
universe. This item universe is typically very large, 
say thousands of items (such as catalog items in 
Amazon.com), and each transaction contains only 
a few items. If each item is treated as a binary 
attribute with 1/0 values, the transaction data is 
extremely high-dimensional and sparse. There are 
two main groups of challenges in anonymizing such 
high-dimensional data. 

2.2.1 Data Utility Challenge 

Query log (or transaction data) anonymization 
aims at preserving privacy while maintaining data 
utility and reducing the information loss. However, 
measuring the utility of the anonymized query logs 
is not always clear. In case of suppression methods, 
the information loss can be a simple count of sup- 
pressed items. For generalization based techniques, 



various metrics have been proposed to measure the 
quality of generalized data including classification 
metric, generalized loss metric |24|, and discernibil- 



ity metric [7] . Some specific transaction anonymiza- 
tion loss measures are normalized centrality penalty 
23], and group generalization distortion 33 . Item- 



set based utility 20 is another utility measure which 



captures frequent itemsets in transaction data. 
Apart from the utility measures mentioned above, 



20 mentioned two other aspects for practical use- 
fulness of the anonymized data. The first is the 
truthfulness of results, i.e. the analysis results (such 
as support of frequent itemset) on the anonymized 
data holds on the original data. The second is the 
value exclusiveness, i.e. the items in the modified 
data are exclusive of each other. This has a sig- 
nificant impact on many data mining tasks based 
on counting queries. For example, the local recod- 



ing transformation 28 does not have this property. 



Consider Figure 2, a local recoding can generalize 
some occurrences of "Apple" and some occurrences 
of "Orange", to "Fruit". Now, it is not possible to 
count the number of transactions containing "Ap- 
ple" or "Orange" from the modified data. 

The major challenge for all query-log anonymiza- 
tion is reducing the significant information loss of 
the anonymized data. This is because each dimen- 
sion (any search term) could be potentially sensitive 
and a potential QID attribute used for record or at- 
tribute linkages, thus employing traditional privacy 
models, such as fc-anonymity, would require includ- 
ing all dimensions into a single QID. Consequently 
lots of data has to be suppressed or generalized to 
the top-most values in order to satisfy fc-anonymity, 
even for small values of k [19]. Although removing 
sensitive terms based on the semantics of the search 
term and context can help increasing the utility of 
anonymized data, the removed sensitive terms can 



still be predicted based on user's other queries 25 



2.2.2 Data Privacy Challenge 

Anonymized query log data has some privacy is- 
sues which are even more important than the above 
utility issues. Firstly, the assumption of having an 
adversary with a very strong background knowledge 
can drastically affect the anonymized data utility. 
Therefore some researchers consider a bounded ad- 
versary with a limited background knowledge (e.g. 
by a maximum number of items) [20] . Although 
this assumption can be realistic, it does not hold 
for cases with unbounded adversary and thus pri- 
vacy is breached. 

Secondly, as discussed in [12], an adversary can 
create multiple accounts and generate many queries 



using those accounts to create special query pat- 
terns (such as a lot of infrequent query, or a distin- 
guishable signature), so that, when the search log is 
sanitized and released, the adversary can use those 
patterns to obtain private information about other 
users. Such issues are still not well studied. 

3. WEB QUERY LOG ANONYMIZATION 
TECHNIQUES 

We classify the query log anonymization meth- 
ods into two groups which models query logs dif- 
ferently. The first group of works deals with query 
logs almost as is, while the second group treat query 
logs as a special case of high-dimensional transac- 
tion data. In this section we briefly explain previous 
representative works in each group. There might be 
some other very recent works which are to some ex- 
tent an incremental variation of these methods. 

3.1 Non- Transactional Anonymity Models 

3.1.1 Query Deletion and Hashing 

Seven query log privacy-enhancing techniques was 
discussed in [10] , including deleting entire query 
logs, hashing query log content, deleting user iden- 
tifiers, deleting personal information in query con- 
tent, hashing user identifiers, shortening sessions, 
and deleting infrequent queries. 

Log deletion is the most privacy-enhancing tech- 
nique; however, the utility of data drops to zero. 
Hashing queries is also not safe since other publicly 
available data, such as previously released query 
logs, or search engine statistics about queries in un- 
hashed form, can be used to pinpoint an individual. 
Similarly, hashing identifiers cannot guarantee elim- 
inating the risk of privacy breach. 

Even after removing identifying information it 
may still be possible to link queries back to individ- 
uals by using other publicly available information. 
Although shortening sessions can be highly privacy- 
protective, due to removal of the link between a user 
and his/her entire query history, the query content 
may still contain identifying information, and thus 
the risks from accidental and malicious disclosure 
will not be totally resolved. In addition, query logs 
with short sessions are less useful for analysis. 

Deleting queries that appear infrequently in the 
logs was suggested in [2j as an effective way of re- 
moving queries that contain identifying information. 
Setting a threshold for being "infrequent" is how- 
ever very challenging. Also studies showed that a 
large number of queries in huge query log datasets, 
occur a small number of times [8]. Consequently, 
this approach may lead to deletion of a remarkable 



amounts of non-identifying queries. 

3.1.2 Token based Hashing 

In token based hashing |27| a query is anonymized 
by tokenizing each query term and securely hash- 
ing each token to an identifier. One major problem 
with this technique is that if an unanonymized ref- 
erence query log has been released previously, the 
adversary could apply co-occurrence analysis and 
frequency analysis on the reference query log to ex- 
tract statistical properties of query terms and then 
processes the anonymized log to invert the hash 
function based on co-occurrences of tokens within 
queries. For example, if an adversary knows how 
often the query "HIV treatment" appears in a pre- 
viously released log can use the statistics to decipher 
the separate hashes for "HIV" and "treatment" . 

3.1.3 Secret Sharing and Split Personality 

The secret sharing [2] anonymization method splits 
a query into k random shares and publishes a new 
share for each distinct user issuing the same query. 
This technique guarantees fc-anonymity because each 
share is useless on its own and all the k shares are 
required to decode the secret. This means that a 
query can be decoded only when there are at least 
k users issuing that query. The result is equivalent 
to suppressing all queries issued by less than k users. 
Since queries are typically sparse, many queries will 
be suppressed as a result. 

Split personality, also proposed in [2] , focus on re- 
ducing the possibility of reconstructing search his- 
tory of a user by splitting the logs of each user 
based on the user "interests". For example, if a 
user is interested in both Sport and Art, then he 
will have two different profiles, one for the queries 
about Sport, and the other for the queries related 
to Art. In this way, the users become dissimilar to 
themselves, however the distortion makes it more 
difficult for researchers to correlate different facets. 
The authors, however, provided no formal privacy 
guarantee for this method. 

3.2 Transactional Anonymity Models 

The following works focused on transaction data 
anonymization. However, some also mentioned query 
log as a special case and used query log data in their 
experiments. 

3.2.1 Randomization Methods 

An early work in transaction anonymizing was 
applying randomization methods, where some items 
are replaced with another and some "false" items 
are inserted into a transaction that looks like "true" 
items |17| . Given a transaction t, the anonymized 



transaction V is generated in three steps: The ran- 
domization operator selects j items from t, uni- 
formly at random with some probability (without 
replacement) and places them into V . It considers 
each item not in t and with some probability adds 
it to t\ 

In another randomization approach [18] , the au- 
thors proposed (pi, p2)-privacy which guarantees that 
if attacker's prior knowledge on a property Q(t) of a 
transaction t before data release is at most p\ then 
after seeing the released randomized transaction i', 
his posterior knowledge is bounded by p 2 , where 
0<pi<jO2<l. They presented a method for find- 
ing the perturbation probabilities that maximizes 
the expected value of \tnt'\ while ensuring (pi,p 2 )- 
privacy. 

3.2.2 Coherence Method 



The coherence method 43 eliminates both record 
linkage attacks and attribute linkage attacks. The 
(h, fc,p)-coherence privacy criterion requires that at 
least k transactions must have any subset of at most 
p non-sensitive items and at most h percent of these 
transactions have some sensitive item. This ensures 
that, for an attacker with the power p, the prob- 
ability of linking an individual to a transaction is 
limited to 1/k and the probability of linking an in- 
dividual to a sensitive item is limited to h. 

Let j3 denote the adversary's background knowl- 
edge that a transaction contains some non-sensitive 
items. An attack is modeled in the form of ft — > e, 
where e is a sensitive item. Let Sup(f3) denote the 
support of ft i.e., the number of such transactions. 
P(f3 -> e) = Sup(/3 U {e})/Sup(/3) is the proba- 
bility that a transaction contains e given that it 
contains j3. The breach probability of /3, denoted 
by PbreachiP) is the maximum P(/3 — ¥ e) for any 
private item e. Assume an adversary's background 
knowledge is up to p non-sensitive items, i.e., |/3| < 
p. If Sup(j3) < k, the adversary is able to link an 
individual to a transaction (record linkage attack) 
and if PbreachiP) > h, the adversary is able to link 
an individual to a sensitive item (attribute linkage 
attack) . 

A mole, is any background knowledge (at most to 
the size p) that can result in a linking attack. Co- 
herence aim at eliminating all moles. For a setting 
of (h, k,p), an itemset /3, with |/3|<p and Sup(/3)>0, 
is called a mole if Sup(j3) < k or Pbre.ach{fl) > 
h. The data D is (h, fc,p)-coherent if D contains 
no moles. Authors in [43] applied the total item 
suppression technique to enforce (h, /c,p)-coherence. 
Total suppression of an item refers to deleting the 
item from all transactions containing it. Although 



total suppression results in a high information loss 
when the data is sparse, it has two nice properties: 
(1) eliminating all moles containing the suppressed 
item, and (2) keeping the support of any remaining 
itemset, equal to the support in the original data. 
The latter one implies that any result derived from 
the modified data, also holds on the original one 
which is not hold for partial suppression. 

Since an optimal solution to (h, fc,p)-coherence, 
i.e. with minimum information loss (suppressed 
items) is ./VP-hard 43 , authors proposed a heuris- 
tic solution. They defined minimal moles as those 
moles that contain no proper subset as a mole in 
which removing them is sufficient for removing all 
moles. An algorithm similar to the well-known Apri- 
ori algorithm for mining frequent itemsets, was used 
to find all minimal moles. 

One problem with the coherence method is the 
scalability issue considering exponential growth of 
itemsets. In a similar work, [42] suggested using sets 
of maximal and minimal itemsets, called borders. 
These borders are typically much smaller than the 
full sets of all itemsets that they represent, thus 
their solution requires much less space and time. 

3.2.3 Band Matrix Approach 

An anonymization method is proposed in |21| to 
prevent attribute linkage attacks for high-dimensional 
data with sensitive items, using a band matrix tech- 
nique. In a band matrix, non-zero entries are con- 
fined to a diagonal band and zero entries on either 
side. In such a matrix, rows correspond to trans- 
actions and columns correspond to items, with the 
0/1 value in each entry. In their method, items 
are divided into sensitive items, and non-sensitive 
items. A non-sensitive transaction, is a transaction 
with no sensitive items and sensitive transactions 
are those with at least one sensitive item. 

A transaction set T has privacy degree of p if 
the probability of associating any transaction tGT 
with a particular sensitive item does not exceed -. 



To achieve this privacy requirement, 21 suggested 



applying two phases: (1) transforming the data to 
a band matrix (using Reverse Cuthill-McKee algo- 
rithm) with respect to non-sensitive attributes, and 
(2) grouping each sensitive transaction with non- 
sensitive transactions or sensitive ones with differ- 
ent sensitive items. The intuition why such band 
matrix formation is helpful, is that it organizes data 
such that consecutive transactions are very likely to 
share many common non-sensitive items and this 
results in a smaller reconstruction error. 

In the second phase each sensitive transaction will 
be grouped with non-sensitive transactions or sen- 



sitive ones with different sensitive items. A greedy 
algorithm based on the "one-occurrence-per-group" 



heuristic, was proposed in 21 which allows only one 



occurrence of each sensitive item in a group. 

3.2.4 k m - Anonymity 

To address the record linkage attacks in transac- 
tion data, 39 proposed the fc m -anonymity notion 



which assumes that any subset of items can be used 
as background knowledge. In this method unlike 
coherence and band matrix approach, data is not 
distinguished as sensitive and non-sensitive, but it 
is considered both as potential quasi-identifiers and 
potential sensitive data. It assumes, like the coher- 
ence method, that an adversary knows at most to 
number of items as background knowledge. 

A transaction database is fc m -anonymous if for 
any set of up to m items, there exist at least k trans- 
actions that contain those items in the published 
database. We can consider this privacy notion as 
a special case of (h, k,p) -coherence with h = 100% 
and p — to, meaning that a subset of items that 
causes violation of fc m -anonymity is a mole under 
the coherence model. 



The anonymization method in 39 applies gen 



eralization in form of the global recoding scheme in 
which when a child node is generalized to its parent, 
all its sibling nodes will also be generalized to their 
parent node, and the generalization process is ap- 
plies to all transactions in the database. Each gen- 
eralization corresponds to a possible horizontal cut 
of the taxonomy tree. The information loss of a cut 
is measured using the normalized certainty penalty 



loss metric 41 which captures the degree of gener- 
alization of an item i, by considering the percentage 
of leaf nodes under i in the item taxonomy. 

If a cut results in a /c m -anonymous database, then 
all its more general cuts, also result in a fc m -anonymous 
database. This is called the monotonicity property 



of cuts 39 . The A: m -anonymization problem is to 



find a fc m -anonymous transformation with the min- 
imum information loss. Based on the monotonicity 
property and in order to prevent higher informa- 
tion loss, as soon as we find a cut that satisfies the 
fc m -anonymity constraint, we do not have to find a 
more general cut. 

Generating the set of all possible cuts and check- 
ing the anonymity violation for every subset of up to 
to items is not applicable for large, realistic prob- 
lems. Thus, authors proposed a greedy heuristic 
method called Apriori anonymization (AA) which 
is based on the apriori principle: if an itemset J 
of size i, violates the anonymity requirement, then 
each superset of J also violates the anonymity re- 



quirement. It explores the space of itemsets in an 
apriori, bottom-up scheme. Meaning that before 
checking if ^-itemsets (£ — 2,. . . , to) violates the 
anonymity requirement, we first eliminate the pos- 
sible anonymity violation caused by (l-l)-itemsets. 
This method drastically reduces the number of item- 
sets that must be checked at a higher level, since 
detailed items could have been generalized. 

3.2.5 Transactional k- Anonymity 

The assumption of bounded background knowl- 
edge of an adversary in the coherence and the k m - 
anonymity methods, has two limitations. Firstly, 
in many cases it is not possible to determine this 
bound in advance. Secondly, these methods can en- 
sure fc-anonymity (with p or to set to the maximum 
transaction length in the database) only if adver- 
sary's background knowledge is limited to the pres- 
ence of items. If the background knowledge is on 
the "absence" of items, the adversary may exclude 
transactions and focus on fewer than k transactions. 
For example, consider an adversary who knows that 
Bob has bought "Orange" and "Chicken", but has 
not bought "Milk". Suppose that three transac- 
tions contain "Orange" , and "Chicken" , in which 
two of them contain "Milk". The adversary can 
exclude the two transaction containing "Milk" and 
link the remaining transaction to Bob. Here, k m 
privacy with k—2 and to=3 is violated, even by set- 
ting to to the maximum transaction length. 

The /c-anonymity approach in [23] , which we re- 
fer to as the Partition method, avoids this prob- 
lem since all transactions in the same equivalence 
class are identical. They extended the original k- 
anonymity for relational data 37 38 , to the trans- 



actional fe-anonymity for "set- valued data" , in which 
a set of values are associated with an individual. 
A transaction database D is fc-anonymous if every 
transaction in D occurs at least k times. Authors 



23 showed that every database which satisfies 
fc-anonymity, also satisfies k m - anonymity for all to 
values, however, the reverse does not always hold. 

The Partition method is the extended version 
of the top-down Mondrian 
tional data. In this method 



2.x 



algorithm for rela- 
if several items are 
generalized to the same item, only one occurrence of 
the generalized item will be kept in the generalized 
transaction. It starts with the single partition con- 
taining all transactions with all items generalized to 
the root item. Then it recursively splits a partition 
by specializing a node in the taxonomy for all the 
transactions in the partition. Next all the trans- 
actions in the partition with the same specialized 
item are distributed to the same sub-partition. At 



the end of distribution, some small sub-partitions 
with less than k transactions are merged into a spe- 
cial leftover sub-partition to be redistributed. The 
partitioning stops if fc-anonymity condition is vio- 
lated. Unlike the Apriori anonymization [39] , the 
Partition approach follows a local recoding scheme. 

3.2.6 Clustering-Based k-Anonymity 

The Partition method suffers from significant in- 
formation loss for two reasons. Firstly, it stops par- 
titioning the data at a high level of the taxonomy 
because the exponential branching for generating 
sub-partitions quickly diminishes the size of a sub- 
partition and causes fc-anonymity violation. This is 
especially true for query logs with large and diverse 
item universe. Secondly, it does not deal with item 
duplication in the generalized transaction. In fact 
preserving term frequency (as much as possible) is 
an important issue for many applications such as 
TFIDF used by ranking algorithms. 

Authors in [33] , adopted the the privacy notion of 



transactional fc-anonymity 23 and proposed a clus 



tering approach to query log anonymization as a so- 
lution to the above shortcomings of the Partition 
method. The main idea in 33 is grouping "sim- 



ilar" transactions together, to reduce the amount 
of required generalization and suppression to make 
them identical. For example, the generalized trans- 
action for <Apple> and <Milk> is <Food>, and 
for <Apple> and <Orange> is <Fruit>. Clearly 
the former entails more information loss. Therefore, 
the transaction anonymization can be treated as a 
clustering problem such that each cluster must con- 
tain at least k transactions and these transactions 
should be "similar" . 

They defined a transaction as a bag of items (thus 
allowing duplicate items). A transaction t' is a gen- 
eralized transaction of a transaction t, if each item 
i'€f ' represents (the generalization of) one "distinct" 
item i £ t. This transaction model has two dis- 
tinctions from [23] , First, it allows duplicate items 
in a transaction and in its generalized transaction. 
For example if V=<Fruit, Fruit> is a generalized 
transaction of t, V represents two leaf items under 
Fruit in t. Second, it allows items in a transaction 
to be on the same path in the item taxonomy while 
each item represents a distinct leaf item. For exam- 
ple, we interpret the transaction <Fruit, Food> 
as: Fruit represents a leaf item under Fruit and 
Food represents a leaf item under Food that is not 
represented by Fruit. 

The least common generalization (LCG) was pro- 
posed as a way to measure the similarity of a subset 
of transactions. The LCG of a set of transactions 



S, is a common generalized transaction for all of the 
transactions in S, and there is no other more spe- 
cial common generalized transaction. The authors 
devised an efficient linear-time bottom-up item gen- 
eralization algorithm to compute LCG. Authors 
proposed group generalization distortion (GGD) as 
a measure to capture both generalization and sup- 
pression distortion of a set of transactions. They 
formulated the transaction anonymization as the 
problem of clustering a set of transactions into clus- 
ters of size at least k such that the sum of GGD of 
LCG of all clusters is minimized. Since the problem 
is iVP-hard, they presented a heuristic linear-time 
algorithm, called Clump, which unlike Partition, 
preserves duplicate items after generalization. 

3. 2. 7 Heuristic Generalization with Heuristic Sup- 
pression 

Authors in [30] were motivated by the limitations 
of the /c" l -anonymity, and proposed to integrate the 
global generalization technique in [39] with the total 
item suppression technique in 43 for enforcing k m - 



anonymity. They applied full subtree generalization 
technique [39] , meaning that a generalization solu- 
tion Cut is defined by a cut on a taxonomy tree 
with exactly one item on every root-to-leaf path. 
Since the full subtree generalization can suffer from 
excessive distortion in the presence of outliers, sup- 
pressing a few outlier items will reduce information 
loss caused by high amount of generalization. They 
applied total item suppression technique, which re- 
moves some items of Cut from all transactions. The 
loss metric is the aggregate of both generalization 
and suppression. 

The anonymized data is derived in two steps: first 
the items are generalized with respect to the Cut 
and then some items of the Cut are suppressed in 
all transactions. Since the number of cuts for a tax- 
onomy is exponential in the number of items, enu- 
merating suppression scenarios for a cut is also in- 
tractable. Consequently, authors provided a heuris- 
tic approach to address this issues. 

3.2.8 Sketch-based Anonymization 

The sketch-based privacy-preserving approach [3] 
reduces the dimensionality of the data by produc- 
ing a much smaller number of features to represent 
the data. This technique is specifically effective for 
high-dimensional sparse data such as query logs. 
The idea is to replace a user's search history by a 
set of sketches. Two privacy criteria associated with 
this technique are 5-anonymity and fc-variance. 

The S- Anonymity ensures that the uncertainty 
in the reconstructed value of each term frequency 



is at least 5. As noted in (3j, a disadvantage of 
(5-anonymity is that it treats each user indepen- 
dently regardless of whether there are other users 
similar to him/her. They argued that it is desir- 
able to give outliers (users who use unique terms) 
more protection than users who are similar to many 
others. Thus, they define the k- variance which en- 
sures that any user's sanitized search history cannot 
be easily distinguished from its fc-nearest neighbors. 
They described algorithms for 5-anonymity and k- 
variance using suppression. 

3.2.9 Semantic Microaggregation 

The anonymization method in | 16| , clusters the 
queries and then replace the original queries by the 
centroids of the corresponding clusters considering 
the semantics of the queries. Authors argued that 
creating a cluster with queries from users with dif- 
ferent "interests" can result in useless protected logs 
and thus queries of users with common interests be- 
tween them should be grouped in the same cluster. 

They used Open Directory Project Q to compute 
the semantic distances between users and partition 
queries into groups of k users with similar inter- 
ests. In the aggregation phase, they compute a new 
user as the representative (or centroid) of the clus- 
ter, which summarizes the queries of all the users 
of the cluster. The query items for the centroid are 
selected by a probabilistic approach based on the 
contribution of the user with respect to number of 
transactions in the cluster. 

3.2.10 Differential Privacy 

Differential privacy [15| is one of the state-of- 
the-art techniques for ensuring privacy and is more 
robust to attacks than any other existing privacy 
definitions. The notion of differential privacy was 



applied for search queries in 26 which adds a ran- 



dom noise to any statistic of a search log such as a 
term frequency. This random noise is drawn inde- 
pendently from the Laplace distribution with mean 
zero and a scaling parameter. The algorithm out- 
put contains frequent queries with noisy statistic of 
the queries and the clicked URLs. 

3.2.11 p-Uncertainty 

The privacy notion p-uncertainty [9] ensures that 
the confidence of any sensitive association rule is at 
most p, while truthful association rules can still be 
derived. Like the works in 21 , and [43] , they dis- 
tinguish between public (non-sensitive) and private 
(sensitive) items. Formally, a p-uncertain transac- 
tion set D does not allow an attacker knowing any 



subset of a transaction t£D to infer a sensitive item 
in t with confidence higher than p. 

The authors proposed a technique that combines 
global generalization over non-sensitive items and 
selective global suppression of some items. This no- 
tion is similar to (h, fc,p)-coherence, however, the 
p-uncertainty model allows an adversary with some 
prior knowledge on the private items. 

4. SUMMARY AND DISCUSSION 
4.1 Privacy Preservation 

In this survey we considered two models for query 
log anonymization: non-transactional model and 
transactional model. Although the techniques men- 
tioned in non-transactional model in Section 3.1 
protect privacy to some extent, there is a lack of for- 
mal privacy guarantee. For example, the release of 
the AOL query log still leads to the re-identification 
of a search engine user even after hashing users iden- 
tifiers |6j. This is because the query content itself 
may be used together with publicly available infor- 
mation for linking attacks. 

In the transactional model we consider Web query 
logs as unstructured transaction data and therefore 
focus on query-log anonymization from transaction 
database anonymization point of view. Such a mod- 
eling, however, might not be a good idea since there 
are strong correlations between keywords within a 
query (based on natural language), and between 
queries within a single session. This is not true of 
transactions. Utilizing these correlations can help 
develop better solutions for the problem in the fu- 
ture. 

Among the previous work in transaction anonymiza- 



tion, the coherence 43 approach can both prevent 
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record linkage attacks and attribute linkage attacks. 
band matrix 21 and p-uncertainty^ approach can 
prevent attribute linkage attack, and both fc m -anonymization 
[39] [30] and fc-anonymization 23 33 prevent record 
linkage attack. 

Both fc m -anonymization and fc-anonymization, do 
not distinguish data as sensitive and non-sensitive 
but as potential QI and SA. In fact, determining 
which items are sensitive is not always possible in 
many real applications considering huge size of the 
item universe. The adversary's background knowl- 
edge is bounded in coherence and fc m -anonymization, 
while in band matrix and fc-anonymization we do 
not limit the attacker's knowledge. A security is- 
sue about bounded knowledge in coherence and k m - 
anonymization was explained by |23| that if back- 
ground knowledge is on the "absence" of items, the 
attacker may exclude transactions using this knowl- 



edge and focus on fewer than fc transactions. The 
HgHs approach also has this privacy issue. 

For the sketch-based privacy-preserving approach 
[3] , authors in [12] argued that one should be careful 
of releasing the (pseudo)randomly generated values 
that were used in the sanitization process in 13] since 
this may allow attackers to reconstruct the original 
data which is a privacy breach. 

While applying differential privacy for search queries 



26 is very promising, like every existing privacy 
definition, it is susceptible to active attacks. The 
assumption that users behave honestly may lead to 
privacy breach. If an attacker creates multiple ac- 
counts and in some of his first queries issues a pri- 
vate query such as someone else's credit card num- 
ber, it could result in publishing this private data 
by the search engine. 

4.2 Utility Preservation 

As discussed in Section 2.2.1, important utility 
factors for the anonymized data are item general- 
ization/suppression loss, truthfulness, itemset util- 
ity, value exclusiveness, and item frequency. 

Authors in [42] and [43] assume that the taxon- 
omy tree for transaction data tend to be flat and 
fanout, and thus decided to use item suppression 
instead of generalization. In this case, employing 
generalization loses more information than employ- 
ing item suppression. However, if the transaction 
database is too sparse, then the item suppression of 
the coherence may cause a large information loss. If 
the data is sparse and the taxonomy is "slim" and 
"tall" , the generalization scheme in fc m -anonymization 
and the fc-anonymization could work better, while if 
the taxonomy is "short" and "wide" , generalization 



causes larger information loss 20 33 



Data analysis on anonymized data is considered 
truthful with respect to the original data if the anal- 
ysis results obtained from the modified data holds 
on the original data 
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The coherence approach 
and fc m -anonymity approaches guarantee truthful 
analysis while it is not the case for the fc-anonymization. 

The analysis of frequent itemsets [5], i.e., the 
items that co-occur frequently in transactions, has a 
vast application in data mining applications such as 
association rule mining, search recommendations, 
and etc. Thus preserving itemsets is an important 
utility factor. Among the discussed approaches, co- 
herence and band matrix can preserve such itemset 
utility. 

The local generalization in transactional fc-anonymity 
approach has a smaller information loss than global 
generalization, however, the anonymized data does 
not have the value exclusiveness, which is impor- 



tant to preserve for many data mining algorithms. 
This means that new algorithms must be designed 
to analyze such data 
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Most of the previous works in transaction data 
anonymization do not deal with item duplication 
meaning that the frequency of a term in a query 
can not be preserved well and will affect utilities 
such as count query results. For example the in- 
formation loss of fc-anonymity method in 23 can 



be high due to item generalization, and eliminat- 
ing duplicate generalized item. The latter reason 
of information loss was not measured by an usual 
information loss metric for relational data where no 
attribute value will be eliminated by generalization. 



Authors in 33 , however, designed their anonymiza- 



tion method in such a way which preserves item 
frequency. 

There is no guarantee for minimum data distor- 
tion in the semantic microaggregation technique 16 
while computing the centroid for the clusters. More- 
over, authors did not consider item generalization 
and its cost in their model. For The sketch-based 
privacy-preserving approach pj], it is interesting to 
see if it would be useful for anonymizing real search 
logs, and when we only have sanitized search logs, 
what kinds of search log analysis can still be con- 
ducted with acceptable accuracy. 

5. CONCLUSION 

Publishing Web query logs for research/marketing 
is restricted by privacy concerns. On the other 
hand, achieving a suitable trade off between pri- 
vacy and utility of query log data is a challeng- 
ing problem. We surveyed some recent studies on 
query log anonymization and categorized them into 
two groups based on how they treat queries. Major 
works consider query logs as transaction data and 
apply techniques to guarantee a desired level of pri- 
vacy. While there is progress in privacy-preservation 
of published query logs, preserving data utility is 
still a challenging issue. 
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